Text mining from pdf files with Tesseract and pdftools

Optical Character Recognition (OCR) for copying text from pdf files

lruolin
10-15-2021

Background

I was about to key in chunks of information into my excel spreadsheet at work this week. Usually, I will try to update my spreadsheet with the information that I read from books, journals, trade magazines so that I can easily retrieve the information about different flavor chemicals just by looking up their CAS number. I learnt this the hard way, because when I first started out on my job, I was giving out reports on flavor composition of different foods to colleagues who were not familiar with flavors. I was not much better as I only first started out on the job and the typing in of the chemicals was already very much a learning task for me to make sure that there are no typos. To allow everyone to know what each flavor chemical was, I went to copy and paste the description individually for the 50 or 60 plus flavor chemicals, each time, when I had to share the report. I spent so much more type copying and pasting then actually intepreting the report and familarising myself with the chemicals!

Finally I realised I should have an excel spreadsheet to store the information. My boss is very experienced in this field and already committed to memory all the different chemicals and their properties, but I was just starting out and had few bytes in my memory.

I built and built my database, by manually typing what I come acorss, until I learnt that Flavorbase could be exported out in excel format, and I joined the databases together, until my colleagues in other locations sent me their database and I merged them together.

Now, I want to add the information from Fenaroli’s handbook of flavor ingredients into my database. I was thinking if I should manually type the over 2000 pages of information? Or is there a better way?

A search on Google resulted in a relevation…. R has this package called tesseract for text mining! I used tabulizer before, there were some hits and misses and I often had to do a lot of cleaning before the data could be used.

Tesseract

I was looking at the vignette intro for the package and decided to try it out for myself. My text cleaning and string manipulation skills can be further improved, but I had a good first attempt at two of the pages!

pdftools

However, I realised I had problem with extracting data stored in tables in the pdf file. This led me to the pdftools package. The problem with this package is that I cannot split the text into sections like what I did for Tesseract.

Learning points:

Procedure

I followed the steps listed on: https://cran.r-project.org/web/packages/tesseract/vignettes/intro.html

Load package

Define file

I chose a page that had all the information printed on a single page. This is the ideal scenario, just for purposes of demonstrating that it will work for me.

Convert pdf file to png file

A screenshot of the file is shown below.

Screenshot of file

Importing the file:

compound_png <- pdftools::pdf_convert("test.pdf", dpi = 600) # this file should be saved in your computer
Converting page 1 to test_1.png... done!

Convert to text

Use optical character recognition on the png file.

Concatenate and print (cat) the file.

text <- tesseract::ocr(compound_png)

cat(text) # to see what was captured
ALLYL SORBATE 67
DFE
ALLYL PROPYL DISULFIDE
Synonyms: Allyl propyl disulphate; Disulfide, 2-propenyl propyl; Disulfide, allyl propyl; 2-Propenyl propyl disulfide; Propeny]
propyl disulfide; 4,5-Dithia-1-octene; Propyl allyl disulfide
[CoBNo.: [600 [EINECS No.: [218-550-7__[JECFANo.: [1700 |
Description: Colorless to yellowish liquid; fruity, garlic aroma.
Consumption: Odor and/or flavor used in cabbage, tropical fruit, garlic, leek, and onion Annual: n/a Individual: n/a
Regulatory Status:

CoE: n/a

FDA: n/a

FDA (other): n/a

JECFA: ADI: Acceptable. No safety concern at current levels of intake when used as a flavoring agent (2007).
Trade association guidelines: FEMA PADI 0.091 mg IOFI: n/a
Empirical Formula/MW:

C.H,,S,/148.29 ee NSN

Specifications: (JECFA, 2008)
Reported uses (ppm): (FEMA, 2005)
Synthesis: n/a
Aroma threshold values: High strength odor, sulfurous type; recommend smelling in a 0.10% solution or less.
Taste threshold values: Taste like that of cooked onions.
Natural occurrence: Reported as the chief volatile constituent in onion oil and found in raw cabbage, chive, garlic oil, leek and
onion.
DFE
ALLYL SORBATE
Synonyms: Allyl-2,4-hexadienoate; Allyl hexa-2,4-dieonoate; Allyl sorbate; 2-Propenyl sorbate; 2,4-Hexadienoic acid, 2-prope-
nyl ester, (E,E)-; (E, E)-2-Propenyl 2,4-hexadienoate; 2,4-hexadienoic acid, 2-propen-l-yl ester (2E, 4E)-
[CoE No.: [2182 [EINECS No.: [231-336-8 JECFANo: [8 |
Description: Allyl sorbate is a colorless liquid with a fruital pineapple-like odor.
Consumption: Annual: <1.00 Ib Individual: 0.00000061 mg/kg/day

Text cleaning

Text patterns - defined

# caps
upper_case_pattern <- "\\b[A-Z]+\\b"  # CAPS

# caps before the word synonyms
caps_before_synon_pattern <-"\\b[A-Z]+\\b.+(?<=Synonyms)"


# define cas number pattern
cas <- "[[:digit:]]+-[[:digit:]]+-[[:digit:]]+"  # digits - digits - digits

# description: between description and consumption
description_pattern <- "(?<=Description: ).+(?= Consumption)"

# Consumption: between Consumption and Regulatory
consumption_pattern <- "(?<=Consumption: ).+(?= Regulatory)"


# Aroma: between values and Taste
aroma_pattern <- "(?<=values: ).+(?= Taste threshold)"

# Taste: between values and Natural
taste_pattern <- "(?<=Taste threshold values: ).+(?= Natural)"
text_clean <-  text %>% 
  str_replace_all("\\n", " ") %>% 
  as_tibble() %>% 
  str_split_fixed(., " DFE ", 4) %>%  # split into sections by DFE
  as_tibble(.name_repair = "unique") %>% 
  pivot_longer(everything()) %>% 
  mutate(no_of_char = map_dbl(value, str_length)) %>% 
  filter(no_of_char > 50) %>%  # to set threshold to filter out irrelevant ones
  mutate(compound = str_extract_all(value, caps_before_synon_pattern,
                                    simplify = T),
         compound_clean = str_trim(str_replace_all(compound, "Synonyms", "")),
         description = str_extract_all(value, description_pattern,
                                       simplify = T),
         consumption = str_extract_all(value, consumption_pattern,
                                       simplify = T),
         aroma = str_extract_all(value, aroma_pattern,
                                 simplify = T),
         taste = str_extract_all(value, taste_pattern,
                                 simplify = T))
  
text_clean
# A tibble: 2 × 9
  name  value  no_of_char compound[,1]  compound_clean description[,1]
  <chr> <chr>       <dbl> <chr>         <chr>          <chr>          
1 ...2  "ALLY…       1107 ALLYL PROPYL… ALLYL PROPYL … Colorless to y…
2 ...3  "ALLY…        449 ALLYL SORBAT… ALLYL SORBATE  Allyl sorbate …
# … with 3 more variables: consumption <chr[,1]>, aroma <chr[,1]>,
#   taste <chr[,1]>

Using pdf tools

library(pdftools)
library(naniar)

table_text <- pdf_text("test.pdf")
# extract first one after compound name

table_pdf <- table_text %>% 
  str_split_fixed(., "\n\n\n", 4) %>% 
  as_tibble() %>% 
  pivot_longer(everything()) %>% 
  mutate(desc = str_squish(value)) %>% 
  mutate(no_of_char = map_dbl(value, str_length)) %>% 
  mutate(compound = str_extract_all(desc, caps_before_synon_pattern,
                                    simplify = T),
         compound_clean = str_trim(str_replace_all(compound, "Synonyms", ""))) %>% 
  select(-compound, value) %>% 
  mutate(cas = str_extract_all(desc, "(?=CAS No.: ).+(?=FL No.)",
                               simplify = T),
         cas_cleaned = str_trim(str_replace_all(cas, "CAS No.:", ""))) %>% 
  select(-cas) %>% 
  mutate(fema = str_extract_all(desc, "(?=FEMA No.: ).+(?= NAS)",
                                simplify = T),
         fema_cleaned = str_trim(str_replace_all(fema, 
                                                 "FEMA No.:", "" ))) %>%
  select(-fema) %>% 
  replace_with_na_at(.vars = c("cas_cleaned", "fema_cleaned"),
                     condition = ~.x == "") %>% 
  
  # remove entries without cas number
  filter(!is.na(cas_cleaned))

table_pdf
# A tibble: 2 × 7
  name  value desc  no_of_char compound_clean cas_cleaned fema_cleaned
  <chr> <chr> <chr>      <dbl> <chr>          <chr>       <chr>       
1 V2    "\nA… ALLY…       1187 ALLYL PROPYL … 2179-59-1   4073        
2 V4    "\nA… ALLY…        684 ALLYL SORBATE  7493-75-6   2041        

As I am unable to split the table_pdf by sections, I will only extract the compound name, CAS and FEMA from the table.

Join two tables

table_pdf
# A tibble: 2 × 7
  name  value desc  no_of_char compound_clean cas_cleaned fema_cleaned
  <chr> <chr> <chr>      <dbl> <chr>          <chr>       <chr>       
1 V2    "\nA… ALLY…       1187 ALLYL PROPYL … 2179-59-1   4073        
2 V4    "\nA… ALLY…        684 ALLYL SORBATE  7493-75-6   2041        
text_clean
# A tibble: 2 × 9
  name  value  no_of_char compound[,1]  compound_clean description[,1]
  <chr> <chr>       <dbl> <chr>         <chr>          <chr>          
1 ...2  "ALLY…       1107 ALLYL PROPYL… ALLYL PROPYL … Colorless to y…
2 ...3  "ALLY…        449 ALLYL SORBAT… ALLYL SORBATE  Allyl sorbate …
# … with 3 more variables: consumption <chr[,1]>, aroma <chr[,1]>,
#   taste <chr[,1]>
merged <- text_clean %>% 
  left_join(table_pdf, by = "compound_clean",
            suffix = c("_text", "_table")
            ) %>% 
  select(value_text, compound_clean, cas_cleaned, fema_cleaned, 
         description, consumption, aroma, taste) %>% 
  map_df(., str_squish)

glimpse(merged)
Rows: 2
Columns: 8
$ value_text     <chr> "ALLYL PROPYL DISULFIDE Synonyms: Allyl propy…
$ compound_clean <chr> "ALLYL PROPYL DISULFIDE", "ALLYL SORBATE"
$ cas_cleaned    <chr> "2179-59-1", "7493-75-6"
$ fema_cleaned   <chr> "4073", "2041"
$ description    <chr> "Colorless to yellowish liquid; fruity, garli…
$ consumption    <chr> "Odor and/or flavor used in cabbage, tropical…
$ aroma          <chr> "High strength odor, sulfurous type; recommen…
$ taste          <chr> "Taste like that of cooked onions.", ""

Allyl sorbate is an incomplete entry, so the text mining is not complete.

However, this is a great start for me to start scraping!

My next step would be to try to scrape more pages and see how I can merge the data together.

Appendix

Code chunk for trying out, before adding to the final cleaning step:

# let me try on test text first
test_text <- text_clean %>% 
  filter(no_of_char > 1000) %>% 
  select(value) %>% 
  pull()

test_text
[1] "ALLYL PROPYL DISULFIDE Synonyms: Allyl propyl disulphate; Disulfide, 2-propenyl propyl; Disulfide, allyl propyl; 2-Propenyl propyl disulfide; Propeny] propyl disulfide; 4,5-Dithia-1-octene; Propyl allyl disulfide [CoBNo.: [600 [EINECS No.: [218-550-7__[JECFANo.: [1700 | Description: Colorless to yellowish liquid; fruity, garlic aroma. Consumption: Odor and/or flavor used in cabbage, tropical fruit, garlic, leek, and onion Annual: n/a Individual: n/a Regulatory Status:  CoE: n/a  FDA: n/a  FDA (other): n/a  JECFA: ADI: Acceptable. No safety concern at current levels of intake when used as a flavoring agent (2007). Trade association guidelines: FEMA PADI 0.091 mg IOFI: n/a Empirical Formula/MW:  C.H,,S,/148.29 ee NSN  Specifications: (JECFA, 2008) Reported uses (ppm): (FEMA, 2005) Synthesis: n/a Aroma threshold values: High strength odor, sulfurous type; recommend smelling in a 0.10% solution or less. Taste threshold values: Taste like that of cooked onions. Natural occurrence: Reported as the chief volatile constituent in onion oil and found in raw cabbage, chive, garlic oil, leek and onion."
# To define different text patterns
str_extract_all(test_text,"(?<=Taste threshold values: ).+(?= Natural)")
[[1]]
[1] "Taste like that of cooked onions."
text_clean %>% 
  filter(compound_clean == "ALLYL SORBATE") %>% 
  select(value) %>% 
  pull()
[1] "ALLYL SORBATE Synonyms: Allyl-2,4-hexadienoate; Allyl hexa-2,4-dieonoate; Allyl sorbate; 2-Propenyl sorbate; 2,4-Hexadienoic acid, 2-prope- nyl ester, (E,E)-; (E, E)-2-Propenyl 2,4-hexadienoate; 2,4-hexadienoic acid, 2-propen-l-yl ester (2E, 4E)- [CoE No.: [2182 [EINECS No.: [231-336-8 JECFANo: [8 | Description: Allyl sorbate is a colorless liquid with a fruital pineapple-like odor. Consumption: Annual: <1.00 Ib Individual: 0.00000061 mg/kg/day "

Reference:

Citation

For attribution, please cite this work as

lruolin (2021, Oct. 15). pRactice corner: Text mining from pdf files with Tesseract and pdftools. Retrieved from https://lruolin.github.io/myBlog/posts/20211015 Text mining with tesseract/

BibTeX citation

@misc{lruolin2021text,
  author = {lruolin, },
  title = {pRactice corner: Text mining from pdf files with Tesseract and pdftools},
  url = {https://lruolin.github.io/myBlog/posts/20211015 Text mining with tesseract/},
  year = {2021}
}